Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations

نویسندگان

  • John Wieting
  • Kevin Gimpel
چکیده

We extend the work of Wieting et al. (2017), back-translating a large parallel corpus to produce a dataset of more than 51 million English-English sentential paraphrase pairs in a dataset we call PARANMT-50M. We find this corpus to be cover many domains and styles of text, in addition to being rich in paraphrases with different sentence structure, and we release it to the community. and release it to the community. To show its utility, we use it to train paraphrastic sentence embeddings using only minor changes to the framework of Wieting et al. (2016b). The resulting embeddings outperform all supervised systems on every SemEval semantic textual similarity (STS) competition, and are a significant improvement in capturing paraphrastic similarity over all other sentence embeddings We also show that our embeddings perform competitively on general sentence embedding tasks. We release the corpus, pretrained sentence embeddings, and code to generate them. We believe the corpus can be a valuable resource for automatic paraphrase generation and can provide a rich source of semantic information to improve downstream natural language understanding tasks.1

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning Paraphrastic Sentence Embeddings from Back-Translated Bitext

We consider the problem of learning general-purpose, paraphrastic sentence embeddings in the setting of Wieting et al. (2016b). We use neural machine translation to generate sentential paraphrases via back-translation of bilingual sentence pairs. We evaluate the paraphrase pairs by their ability to serve as training data for learning paraphrastic sentence embeddings. We find that the data quali...

متن کامل

Revisiting Recurrent Networks for Paraphrastic Sentence Embeddings

We consider the problem of learning general-purpose, paraphrastic sentence embeddings, revisiting the setting of Wieting et al. (2016b). While they found LSTM recurrent networks to underperform word averaging, we present several developments that together produce the opposite conclusion. These include training on sentence pairs rather than phrase pairs, averaging states to represent sequences, ...

متن کامل

Towards Universal Paraphrastic Sentence Embeddings

We consider the problem of learning general-purpose, paraphrastic sentence embeddings based on supervision from the Paraphrase Database (Ganitkevitch et al., 2013). We compare six compositional architectures, evaluating them on annotated textual similarity datasets drawn both from the same distribution as the training data and from a wide range of other domains. We find that the most complex ar...

متن کامل

A Comparison of Methods for Identifying the Translation of Words in a Comparable Corpus: Recipes and Limits

Identifying translations in comparable corpora is a challenge that has attracted many researchers since a long time. It has applications in several applications including Machine Translation and Cross-lingual Information Retrieval. In this study we compare three state-of-the-art approaches for these tasks: the so-called context-based projection method, the projection of monolingual word embeddi...

متن کامل

Using Word Embeddings to Enforce Document-Level Lexical Consistency in Machine Translation

We integrate newmechanisms in a document-level machine translation decoder to improve the lexical consistency of document translations. First, we develop a document-level feature designed to score the lexical consistency of a translation. This feature, which applies towords that have been translated into different forms within the document, uses word embeddings to measure the adequacy of each w...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1711.05732  شماره 

صفحات  -

تاریخ انتشار 2017